Introduction to R language

Dae-Jin Lee < dlee@bcamath.org >

BCAM-UPV/EHU Courses 2018-2019

Install and load an R library

Install a package

install.packages("DAAG") # (Data Analysis And Graphics)

or several packages

install.packages(c("lattice","HSAUR2","Hmisc","psych","foreign","xlsx","maps","maptools","RColorBrewer","calibrate"))

In Rstudio (go to Packages and click Install)

Once installed the package, load it

library(DAAG) # or require(DAAG)

Start with R

getwd() 
ls()
setwd("/Users/dlee") 
history() # display last 25 commands
history(max.show=Inf) # display all previous commands
savehistory(file="myfile") # default is ".Rhistory"
loadhistory(file="myfile") # default is ".Rhistory"
save.image()
save(<object list>,file="myfile.RData") 
load("myfile.RData") 
q()

Reading data in R

The R console

x <- c(7.82,8.00,7.95) # c means "combine"
x
## [1] 7.82 8.00 7.95

A quicker way is to use scan()

x <- scan()  # enter a number followed by return and blank line to end
1: 7.82
2: 8.00
3: 7.95
4: 
Read 3 items

To create a character vector use ""

id <- c("John","Paul","George","Ringo")

To read a character vector

id <- scan(,"")
1: John
2: Paul
3: George
4: Ringo
5: 
Read 4 items  
id
## [1] "John"   "Paul"   "George" "Ringo"

Data Import

In most situations, we need to read data from a separate data file. There are several methods for doing this.

cat("Example:", "2 3 5 7", "11 13 17", file = "ex.txt", sep = "\n") # creates ex.txt
scan("ex.txt", skip = 1)
## [1]  2  3  5  7 11 13 17
scan("ex.txt", skip = 1, nlines = 1) # only 1 line after the skipped one
## [1] 2 3 5 7
unlink("ex.data") # tidy up
library(gdata)
library(foreign)

Create a folder, name it data and download cars data (cardata.zip)

mydata1 = read.table("data/cardata.txt") 
mydata2 = read.csv("data/cardata.csv")  
library(gdata)
mydata3 = read.xls("data/cardata.xls", sheet = 1, header = TRUE)

library(xlsx)
mydata4 = read.xlsx("data/cardata.xlsx", sheetIndex = 1, header = TRUE,colClasses=NA)
library(foreign)                   
mydata = read.mtp("mydata.mtp")  # Minitab
mydata = read.spss("myfile", to.data.frame=TRUE) # SPSS
mydata = read.dta("mydata.dta") # Stata
library(Hmisc)
mydata = spss.get("mydata.por", use.value.labels=TRUE)  # SPSS

Exporting data

mtcars
?mtcars    
write.table(mtcars, "cardata.txt", sep="\t") 
library(xlsx)
write.xlsx(mydata, "mydata.xlsx")

Data vectors

weight<-c(60,72,57,90,95,72)  
class(weight)
## [1] "numeric"
height<-c(1.75,1.80,1.65,1.90,1.74,1.91)
bmi<- weight/height^2
bmi
## [1] 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630

Basic statistics

mean(weight) 
median(weight)
sd(weight)
var(weight)
summary(weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   63.00   72.00   74.33   85.50   95.00
min(weight)
max(weight)
range(weight)
sum(weight)
length(weight)

There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.

quantile(weight)
##   0%  25%  50%  75% 100% 
## 57.0 63.0 72.0 85.5 95.0

The \(n^{\rm th}\) percentile of an observation variable is the value that cuts off the first \(n\) percent of the data values when it is sorted in ascending order.

quantile(weight,c(0.32,0.57,0.98))
##  32%  57%  98% 
## 67.2 72.0 94.5

The covariance of two variables \(x\) and \(y\) in a data sample measures how the two are linearly related. A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.

\[ \rm{Cov}(x,y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \]

cov(weight,height)
## [1] 0.6773333

The correlation coefficient of two variables in a data sample is their covariance divided by the product of their standard deviations. It is a normalised measurement of how the two are linearly related.

Formally, the sample correlation coefficient is defined by the following formula, where \(\sigma_x\) and \(\sigma_y\) are the sample standard deviations, and \(\sigma_xy\) is the covariance.

\[ \rho_{xy} = \frac{\sigma_{xy}}{\sigma_x~\sigma_y} \]

cor(weight,height)
## [1] 0.437934

Creating your own functions in R

One of the great strengths of R is the user's ability to add functions. In fact, many of the functions in R are actually functions of functions. The structure of a function is given below.

myfunction <- function(arg1, arg2, ... ){
  statements
return(object)
}
f <- function(x){
  x^2
    }
f
## function(x){
##   x^2
##     }

Example:

# Given a number
f(2)
## [1] 4
# Given a vector
x <- c(1,2,-4,7)
f(x)
## [1]  1  4 16 49

Let us create a function that returns a set of summary statistics given a numeric vector:

mysummary <- function(x){
  mean <- sum(x)/length(x)
   var <- var(x)
    sd <- sd(x)
 range <- range(x)
 result <- list(mean=mean,var=var,sd=sd,range=range)
 return(result)
}

Then

set.seed(1234)
x <- rnorm(10)
stats <- mysummary(x)
stats
## $mean
## [1] -0.3831574
## 
## $var
## [1] 0.9915928
## 
## $sd
## [1] 0.9957875
## 
## $range
## [1] -2.345698  1.084441

Character vectors and factor variables

subject <- c("John","Peter","Chris","Tony","Mary","Jane")
sex <- c("MALE","MALE","MALE","MALE","FEMALE","FEMALE")
class(subject)
## [1] "character"
table(sex)
## sex
## FEMALE   MALE 
##      2      4

Data frames

A data.frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Dat <- data.frame(subject,sex,weight,height)
# add bmi to Dat
Dat$bmi <- bmi  # or Dat$bmi <- weight/height^2
class(Dat)
## [1] "data.frame"
str(Dat) # display object structure
## 'data.frame':    6 obs. of  5 variables:
##  $ subject: Factor w/ 6 levels "Chris","Jane",..: 3 5 1 6 4 2
##  $ sex    : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 2 1 1
##  $ weight : num  60 72 57 90 95 72
##  $ height : num  1.75 1.8 1.65 1.9 1.74 1.91
##  $ bmi    : num  19.6 22.2 20.9 24.9 31.4 ...
# Change rownames
rownames(Dat)<-c("A","B","C","D","E","F")

# Access to data frame elements (similar to a matrix)
Dat[,1]     # 1st column
## [1] John  Peter Chris Tony  Mary  Jane 
## Levels: Chris Jane John Mary Peter Tony
Dat[,1:3]   # 1st to 3rd columns
##   subject    sex weight
## A    John   MALE     60
## B   Peter   MALE     72
## C   Chris   MALE     57
## D    Tony   MALE     90
## E    Mary FEMALE     95
## F    Jane FEMALE     72
Dat[1:2,]   # 1st to 2nd row
##   subject  sex weight height      bmi
## A    John MALE     60   1.75 19.59184
## B   Peter MALE     72   1.80 22.22222

Working with data frames

Example: Analyze data by groups

  1. Select each group and compute the mean
Dat[sex=="MALE",]
Dat[sex=="FEMALE",]

mean(Dat[sex=="MALE",3])  # weight average of MALEs
mean(Dat[sex=="MALE","weight"])
  1. Use apply by columns
apply(Dat[sex=="FEMALE",3:5],2,mean)
apply(Dat[sex=="MALE",3:5],2,mean)

# we can use apply with our own function
apply(Dat[sex=="FEMALE",3:5],2,function(x){x+2})
  1. by and colMeans
# 'by' splits your data by factors and do calculations on each subset.
by(Dat[,3:5],sex, colMeans) 
  1. aggregate
# another option
aggregate(Dat[,3:5], by=list(sex),mean) 

Logical vectors

bmi
bmi>22
as.numeric(bmi>22) # convert a logical condition to a numeric value 0/1
which(bmi>22)  # gives the position of bmi for which bmi>22
bmi > 20 & bmi < 25
which(bmi > 20 & bmi < 25)

Working with vectors

x <- c(2, 3, 5, 2, 7, 1)
y <- c(10, 15, 12)
z <- c(x,y)  # concatenates x and y
zz <- list(x,y) # create a list
unlist(zz) # unlist the list converting it to a concatenated vector
## [1]  2  3  5  2  7  1 10 15 12
x[c(1,3,4)]
## [1] 2 5 2
x[-c(2,6)] # negative subscripts omit the chosen elements 
## [1] 2 5 2 7
seq(1,9) # or 1:9
## [1] 1 2 3 4 5 6 7 8 9
seq(1,9,by=1)
## [1] 1 2 3 4 5 6 7 8 9
seq(1,9,by=0.5)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0
seq(1,9,length=20)
##  [1] 1.000000 1.421053 1.842105 2.263158 2.684211 3.105263 3.526316
##  [8] 3.947368 4.368421 4.789474 5.210526 5.631579 6.052632 6.473684
## [15] 6.894737 7.315789 7.736842 8.157895 8.578947 9.000000
oops <- c(7,9,13)
rep(oops,3) # repeats the entire vector "oops" three times
rep(oops,1:3) # this function has the number 3 replaced 
              #  by a vector with the three values (1,2,3) 
              #  indicating that 7 should be repeated once, 9 twice and 13 three times.

rep(c(2,3,5), 4)
rep(1:2,c(10,15))

rep(c("MALE","FEMALE"),c(4,2)) # it also works with character vectors 
c(rep("MALE",3), rep("FEMALE",2))

Matrices and arrays

x<- 1:12
x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
dim(x)<-c(3,4)  # 3 rows and 4 columns

X <- matrix(1:12,nrow=3,byrow=TRUE)
X
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
X <- matrix(1:12,nrow=3,byrow=FALSE)
X
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
# rownames, colnames

rownames(X) <- c("A","B","C")
X
##   [,1] [,2] [,3] [,4]
## A    1    4    7   10
## B    2    5    8   11
## C    3    6    9   12
colnames(X) <- LETTERS[4:7]
X
##   D E F  G
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12
colnames(X) <- month.abb[4:7]
X
##   Apr May Jun Jul
## A   1   4   7  10
## B   2   5   8  11
## C   3   6   9  12
Y <- matrix(0.1*(1:12),3,4)

cbind(X,Y)  # bind column-wise
##   Apr May Jun Jul                
## A   1   4   7  10 0.1 0.4 0.7 1.0
## B   2   5   8  11 0.2 0.5 0.8 1.1
## C   3   6   9  12 0.3 0.6 0.9 1.2
rbind(X,Y)  # bind row-wise
##   Apr May Jun  Jul
## A 1.0 4.0 7.0 10.0
## B 2.0 5.0 8.0 11.0
## C 3.0 6.0 9.0 12.0
##   0.1 0.4 0.7  1.0
##   0.2 0.5 0.8  1.1
##   0.3 0.6 0.9  1.2

Factors

`Factors are the data objects which are used to categorize the data and store it as levels.

gender<-c(rep("female",691),rep("male",692))
is.factor(gender)
## [1] FALSE
class(gender)
## [1] "character"
# change vector to factor (i.e. a category)
gender<- factor(gender)
is.factor(gender)
## [1] TRUE
levels(gender)
## [1] "female" "male"
summary(gender)
## female   male 
##    691    692
table(gender)
## gender
## female   male 
##    691    692
status<- c(0,3,2,1,4,5)    # This command creates a numerical vector pain, 
                           #    encoding the pain level of five patients.
fstatus <- factor(status, levels=0:5)
levels(fstatus) <- c("student","engineer","unemployed","lawyer","economist","dentist")

Dat$status <- fstatus
Dat
##   subject    sex weight height      bmi     status
## A    John   MALE     60   1.75 19.59184    student
## B   Peter   MALE     72   1.80 22.22222     lawyer
## C   Chris   MALE     57   1.65 20.93664 unemployed
## D    Tony   MALE     90   1.90 24.93075   engineer
## E    Mary FEMALE     95   1.74 31.37799  economist
## F    Jane FEMALE     72   1.91 19.73630    dentist
data <- c("East","West","East","North","North","East","West",
   "West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)
##  [1] East  West  East  North North East  West  West  West  East  North
## Levels: East North West
# Apply the factor function with required order of the level.
new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)
##  [1] East  West  East  North North East  West  West  West  East  North
## Levels: East West North

Indexing vector with logicals

a <- c(1,2,3,4,5)
b <- c(TRUE,FALSE,FALSE,TRUE,FALSE)

max(a[b])
## [1] 4
sum(a[b])
## [1] 5

Missing values

In R, missing values are represented by the symbol NA (not available) . Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).

a <- c(1,2,3,4,NA)
sum(a)
## [1] NA

Excluding missing values from functions

sum(a,na.rm=TRUE)
## [1] 10
a <- c(1,2,3,4,NA)
is.na(a)
## [1] FALSE FALSE FALSE FALSE  TRUE

The function complete.cases() returns a logical vector indicating which cases are complete.

complete.cases(a)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE

The function na.omit() returns the object with listwise deletion of missing values.

na.omit(a) 
## [1] 1 2 3 4
## attr(,"na.action")
## [1] 5
## attr(,"class")
## [1] "omit"

NA in data frames:

require(graphics)
?airquality
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
pairs(airquality, panel = panel.smooth, main = "airquality data")

ok <- complete.cases(airquality)
airquality[ok,]
##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 38     29     127  9.7   82     6   7
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 44     23     148  8.0   82     6  13
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

Working with data frames

mtcars
?mtcars       # or help(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars) # display the structure of the data frame
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars["Mazda RX4",] # using rows and columns names
mtcars[c("Datsun 710", "Camaro Z28"),] 
mtcars[,c("mpg","am")]

There are some packages that include particular functions to summarize data frames, for instance the library psych has the function describe

library(psych)
describe(mtcars)

Basic plotting and data analysis in R

data("mtcars")
attach(mtcars) #?attach
plot(wt, mpg, main="Scatterplot Example",
   xlab="Car Weight", ylab="Miles Per Gallon", pch=19) 

pairs(~mpg+disp+drat+wt,data=mtcars,
   main="Simple Scatterplot Matrix")

tab <- table(mtcars[,c("cyl")])
barplot(tab)

pie(tab)

Exercises:

  1. The data.frame VADeaths contains the death rates per 1000 in Virginia (US) in 1940. The death rates are measured per 1000 population per year. They are cross-classified by age group (rows) and population group (columns).
data(VADeaths)
VADeaths
##       Rural Male Rural Female Urban Male Urban Female
## 50-54       11.7          8.7       15.4          8.4
## 55-59       18.1         11.7       24.3         13.6
## 60-64       26.9         20.3       37.0         19.3
## 65-69       41.0         30.9       54.6         35.1
## 70-74       66.0         54.3       71.1         50.0
##  50-54  55-59  60-64  65-69  70-74 
## 11.050 16.925 25.875 40.400 60.350
##   Rural Male Rural Female   Urban Male Urban Female 
##        32.74        25.18        40.48        25.28
  1. The data.frame rainforest contains several variables from different species
library(DAAG)
rainforest
?rainforest
names(rainforest)
## 
## Acacia mabellae      C. fraseri  Acmena smithii   B. myrtifolia 
##              16              12              26              11

  1. The Acmena data.frame is created from rainforest using the function subset.
Acmena <- subset(rainforest, species == "Acmena smithii")

  1. Create a vector of the positive odd integers less than 100 and remove the values greater than 60 and less than 80.

    • Result:
##  [1] 61 63 65 67 69 71 73 75 77 79

Operators

R's binary and logical operators will look very familiar to programmers. Note that binary operators work on vectors and matrices as well as scalars.

Arithmetic Operators

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2

Logical Operators

Operator Description
< less than
> greater than
<= less or equal to
>= greater or equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&y x AND y
isTRUE(x) test if x is TRUE
# An example
x <- c(1:10)
x[(x>8) | (x<5)]
## [1]  1  2  3  4  9 10
# yields 1 2 3 4 9 10

# How it works
x <- c(1:10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x > 8
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
x < 5
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x > 8 | x < 5
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
x[c(T,T,T,T,F,F,F,F,T,T)]
## [1]  1  2  3  4  9 10

Control Structures

These allow you to control the flow of execution of a script typically inside of a function. Common ones include:

Conditional Executions

if

if (condition) {
    # do something
} else {
    # do something else
}

e.g.:

x <- 1:15
if (sample(x, 1) <= 10) { # ?sample  
    print("x is less than 10")
} else {
    print("x is greater than 10")
}
## [1] "x is less than 10"

Vectorization with ifelse

ifelse(x <= 10, "x less than 10", "x greater than 10")

Other valid ways of writing if/else

if (sample(x, 1) < 10) {
    y <- 5
} else {
    y <- 0
}
y <- if (sample(x, 1) < 10) {
    5
} else {
    0
}

for

A for loop works on an iterable variable and assigns successive values till the end of a sequence.

for (i in 1:10) {
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
x <- c("apples", "oranges", "bananas", "strawberries")

for (i in x) {
    print(x[i])
}
## [1] NA
## [1] NA
## [1] NA
## [1] NA
for (i in 1:4) {
    print(x[i])
}
## [1] "apples"
## [1] "oranges"
## [1] "bananas"
## [1] "strawberries"
for (i in seq(x)) {
    print(x[i])
}
## [1] "apples"
## [1] "oranges"
## [1] "bananas"
## [1] "strawberries"
for (i in 1:4) print(x[i])
## [1] "apples"
## [1] "oranges"
## [1] "bananas"
## [1] "strawberries"

nested lopps

m <- matrix(1:10, 2)
for (i in seq(nrow(m))) {
    for (j in seq(ncol(m))) {
        print(m[i, j])
    }
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10

while

i <- 1
while (i < 10) {
    print(i)
    i <- i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9

Be sure there is a way to exit out of a while loop.

Repeat and break

repeat {
    # simulations; generate some value have an expectation if within some range,
    # then exit the loop
    if ((value - expectation) <= threshold) {
        break
    }
}

Next

for (i in 1:20) {
    if (i%%2 == 1) {  # %% is the modulus
        next
    } else {
        print(i)
    }
}
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20

Comparison Operators

  "hola" == "hola"
## [1] TRUE
  "hola" == "Hola"
## [1] FALSE
   1 == 2-1
## [1] TRUE
    a <- c(1,2,4,5)
    b <- c(1,2,3,5) 
    a == b
## [1]  TRUE  TRUE FALSE  TRUE
    a != b
## [1] FALSE FALSE  TRUE FALSE
set.seed(1)
a <- rnorm(10)
b <- rnorm(10)
a<b
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
set.seed(2)
a <- rnorm(10)
b <- rnorm(10)
a >= b
##  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
set.seed(3)
which(a>b)
## [1] 3 6 9
LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
which(LETTERS=="R")
## [1] 18
set.seed(4)
a <- rnorm(10)
a
##  [1]  0.2167549 -0.5424926  0.8911446  0.5959806  1.6356180  0.6892754
##  [7] -1.2812466 -0.2131445  1.8965399  1.7768632
which.min(a)
## [1] 7
which.max(a)
## [1] 9
 a[2] <- NA
is.na(a)
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
which(is.na(a))
## [1] 2

Logical Operators

z = 1:6
which(2 < z & z > 3)
## [1] 4 5 6
z = 1:6
(z > 2) & (z < 5)
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE
which((z > 2) & (z < 5))
## [1] 3 4
x <- c(TRUE,FALSE,0,6)
y <- c(FALSE,TRUE,FALSE,TRUE)

!x
## [1] FALSE  TRUE  TRUE FALSE

Operators & and | perform element-wise operation producing the result having the length of the longer operand. But && and || examines only the first element of the operands resulting into a single length logical vector. Zero is considered FALSE and non-zero numbers are taken as TRUE.

Example: - && vs &

x&y
## [1] FALSE FALSE FALSE  TRUE
x&&y
## [1] FALSE
x||y
## [1] TRUE
x|y
## [1]  TRUE  TRUE FALSE  TRUE

if statements

if(cond1=true) { cmd1 } else { cmd2 }

if(1==0) {
    print(1)
} else {
    print(2)
}
## [1] 2

ifelse statement

ifelse(test, true_value, false_value)

x <- 1:10 # Creates sample data
ifelse(x<5 | x>8, x, 0)
##  [1]  1  2  3  4  0  0  0  0  9 10

Loops

The most commonly used loop structures in R are for, while and apply loops. Less common are repeat loops. The break function is used to break out of loops, and next halts the processing of the current iteration and advances the looping index.

Writing a simple for loop in R

Suppose you want to do several printouts of the following form: The year is [year] where [year] is equal to 2010, 2011, up to 2015. You can do this as follows:

print(paste("The year is", 2010))
## [1] "The year is 2010"

for

For loops are controlled by a looping vector. In every iteration of the loop one value in the looping vector is assigned to a variable that can be used in the statements of the body of the loop. Usually, the number of loop iterations is defined by the number of values stored in the looping vector and they are processed in the same order as they are stored in the looping vector.

Syntax

for(variable in sequence) {
    statements
}
for (j in 1:5)
{
  print(j^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25

Repeat the loop saving the resuls in a vector x.

n = 5
x = NULL  # creates a NULL object
for (j in 1:n)
{
  x[j] = j^2
}
x
## [1]  1  4  9 16 25

Let's use a for loop to estimate the average of squaring the result of a roll of a dice.

nsides = 6
ntrials = 1000
trials = NULL
for (j in 1:ntrials)
{
  trials[j] = sample(1:nsides,1)  # We get one sample at a time
}
mean(trials^2)
## [1] 14.563

Example: stop on condition and print error message

x <- 1:10
z <- NULL
for(i in seq(along=x)) {
    if (x[i]<5) {
        z <- c(z,x[i]-1) 
    } else {
        stop("values need to be <5")
    }
}
## Error: values need to be <5
z
## [1] 0 1 2 3

while

Similar to for loop, but the iterations are controlled by a conditional statement.

z <- 0
while(z < 5) {
    z <- z + 2
    print(z) 
}
## [1] 2
## [1] 4
## [1] 6

apply loop family

&nsbp;

For Two-Dimensional Data Sets: apply

Syntax:

apply(X, MARGIN, FUN, ARGs)

X: array, matrix or data.frame; MARGIN: 1 for rows, 2 for columns, c(1,2) for both; FUN: one or more functions; ARGs: possible arguments for function.

## Example for applying predefined mean function
apply(mtcars[,1:3], 1, mean)

## With custom function
x <- 1:10
test <- function(x) { # Defines some custom function
    if(x < 5) {
        x-1
    } else {
        x / x
    }
} 

apply(as.matrix(x), 1, test) 

## Same as above but with a single line of code
apply(as.matrix(x), 1, function(x) { if (x<5) { x-1 } else { x/x } })

For Ragged Arrays: tapply

Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

## Computes mean values of vector agregates defined by factor
tapply(as.vector(mtcars$mpg), factor(mtcars$cyl), mean)
##        4        6        8 
## 26.66364 19.74286 15.10000
## The aggregate function provides related utilities
aggregate(mtcars[,c(1,3,4)], list(mtcars$cyl), mean)
##   Group.1      mpg     disp        hp
## 1       4 26.66364 105.1364  82.63636
## 2       6 19.74286 183.3143 122.28571
## 3       8 15.10000 353.1000 209.21429

For Vectors and Lists: lapply and sapply

Both apply a function to vector or list objects. The function lapply returns a list, while sapply attempts to return the simplest data object, such as vector or matrix instead of list.

Syntax

lapply(X,FUN)
sapply(X,FUN)
## Creates a sample list
mylist <- as.list(mtcars[,c(1,4,6)])
mylist
## $mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
## 
## $hp
##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230
## [18]  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109
## 
## $wt
##  [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440
## [12] 4.070 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520
## [23] 3.435 3.840 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

Compute sum of each list component and return result as list

lapply(mylist, sum)
## $mpg
## [1] 642.9
## 
## $hp
## [1] 4694
## 
## $wt
## [1] 102.952

Compute sum of each list component and return result as vector

sapply(mylist, sum)
##      mpg       hp       wt 
##  642.900 4694.000  102.952

Other Loops

Repeat Loop

Syntax

repeat statements

Loop is repeated until a break is specified. This means there needs to be a second statement to test whether or not to break from the loop.

Example:

z <- 0
repeat {
    z <- z + 1
    print(z)
    if(z > 100) break()
}

Improving Speed Performance of Loops

Looping over very large data sets can become slow in R. However, this limitation can be overcome by eliminating certain operations in loops or avoiding loops over the data intensive dimension in an object altogether. The latter can be achieved by performing mainly vector-to-vector or matrix-to-matrix computations which run often over 100 times faster than the corresponding for() or apply() loops in R. For this purpose, one can make use of the existing speed-optimized R functions (e.g.: rowSums, rowMeans, table, tabulate) or one can design custom functions that avoid expensive R loops by using vector- or matrix-based approaches. Alternatively, one can write programs that will perform all time consuming computations on the C-level.

  1. Speed comparison of for loops with an append versus and inject step
N <- 1e3
myMA <- matrix(rnorm(N), N, 10, dimnames=list(1:N, paste("C", 1:10, sep="")))
results <- NULL
system.time(for(i in seq(along=myMA[,1])) 
            results <- c(results, mean(myMA[i,])))

results <- numeric(length(myMA[,1]))
system.time(for(i in seq(along=myMA[,1])) 
            results[i] <- mean(myMA[i,]))

The inject approach is 20-50 times faster than the append version.

  1. Speed comparison of apply loop versus rowMeans for computing the mean for each row in a large matrix:
system.time(myMAmean <- apply(myMA, 1, mean))
system.time(myMAmean <- rowMeans(myMA))

The rowMeans approach is over 200 times faster than the apply loop.

Scatterplots

library(MASS)
data("mammals")
?mammals
head(mammals)
##                    body brain
## Arctic fox        3.385  44.5
## Owl monkey        0.480  15.5
## Mountain beaver   1.350   8.1
## Cow             465.000 423.0
## Grey wolf        36.330 119.5
## Goat             27.660 115.0
attach(mammals)
species <- row.names(mammals)
x <- body
y <- brain
library(calibrate)
# scatterplot
plot(x,y, xlab = "body weight in kgr", ylab = "brain weight in gr", 
     main="Body vs Brain weight \n for 62 Species of Land Mammals",xlim=c(0,8500))
textxy(x,y,labs=species,col = "blue",cex=0.85) 

Identify a point in the scatterplot

identify(x,y,species)

Plot in the log scale

plot(log(x),log(y), xlab = "log body weight in kgr", ylab = "log brain weight in gr", 
     main="log Body vs log Brain weight \n for 62 Species of Land Mammals")
textxy(log(x),log(y),labs=species,col = "blue",cex=0.85) 

Identify a point in the log scale scatterplot

identify(log(x),log(y),species)

Multiple Data Sets on One Plot

One common task is to plot multiple data sets on the same plot. In many situations, the way to do this is to create the initial plot and then add additional information to the plot. For example, to plot bivariate data the plot command is used to initialise and create the plot. The points command can then be used to add additional datasets to the plot.

set.seed(1234)
 x <- rnorm(10,sd=5,mean=20)
 y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
 cor(x,y)
## [1] 0.7512194
 plot(x,y,xlab="Independent",ylab="Dependent",main="Random plot")
 x1 <- runif(8,15,25)
 y1 <- 2.5*x1 - 1.0 + runif(8,-6,6)
 points(x1,y1,col=2)

with legend and \((x_2,y_2)\) points:

set.seed(1234)
x2 <- runif(8,15,25)
y2 <- 2.5*x2 - 1.0 + runif(8,-6,6)
 plot(x,y,xlab="Independent",ylab="Dependent",main="Random plot")
 points(x1,y1,col=2,pch=3)
 points(x2,y2,col=4,pch=5)
 legend("topleft",c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))

Multiple Graphs on One Image:

set.seed(1234)
 par(mfrow=c(2,3))
 boxplot(rnorm(100),main="first plot")
 boxplot(rgamma(100,2),main="second plot", horizontal=TRUE,col="bisque")
 plot(rnorm(100),xlab="third plot",
      ylab="y-label",main="x-label")
 hist(rnorm(100),main="fourth plot",col="lightgrey")
 hist(rexp(100),main="fifth plot",col="blue")
 plot(rnorm(100),rexp(100),main="sixth plot")

Pairwise relationships

uData <- rnorm(20)
vData <- rnorm(20,mean=5)
wData <- uData + 2*vData + rnorm(20,sd=0.5)
xData <- -2*uData+rnorm(20,sd=0.1)
yData <-  3*vData+rnorm(20,sd=2.5)
d <- data.frame(u=uData,v=vData,w=wData,x=xData,y=yData)
pairs(d)

Plotting correlations

The function corrplot in the library(corrplot) visualizes a correlation matrix calculate with function cor

library(corrplot)
## corrplot 0.84 loaded
M <- cor(d)
corrplot(M, method="circle",type="upper")

Plotting surfaces: image, contour and persp plots

x <- seq(0,2*pi,by=pi/50)
y <- x
xg <- (x*0+1) %*% t(y)
yg <- (x) %*% t(y*0+1)
f <- sin(xg*yg)

par(mfrow=c(2,2))
image(x,y,f)
contour(x,y,f)
contour(x,y,f,nlevels=4)
image(x,y,f,col=grey.colors(100))
contour(x,y,f,nlevels=4,add=TRUE,col="red")

Similarly, one can use persp plot

persp(x,y,f,theta=-30,phi=55,col="lightgrey",shade=.01)

Or plot images

library(fields)
data(lennon)
image(lennon,col=grey(seq(0,1,l=256)))

Tables and Cross-classification

library(MASS)
data(quine)
?quine
attach(quine)
table(Sex)
## Sex
##  F  M 
## 80 66
table(Sex,Age)
##    Age
## Sex F0 F1 F2 F3
##   F 10 32 19 19
##   M 17 14 21 14
# or xtabs
xtabs(~Sex+Age,data=quine)
##    Age
## Sex F0 F1 F2 F3
##   F 10 32 19 19
##   M 17 14 21 14
xtabs(~Sex+Age+Eth,data=quine)
## , , Eth = A
## 
##    Age
## Sex F0 F1 F2 F3
##   F  5 15  9  9
##   M  8  5 11  7
## 
## , , Eth = N
## 
##    Age
## Sex F0 F1 F2 F3
##   F  5 17 10 10
##   M  9  9 10  7

Calculations of cross-classifications

tapply(Days,Age,mean)
##       F0       F1       F2       F3 
## 14.85185 11.15217 21.05000 19.60606
tapply(Days,list(Sex,Age),mean)
##         F0       F1       F2       F3
## F 18.70000 12.96875 18.42105 14.00000
## M 12.58824  7.00000 23.42857 27.21429
tapply(Days,list(Sex,Age),function(x) sqrt(var(x)/length(x)))
##         F0       F1       F2       F3
## F 4.208589 2.329892 5.299959 2.940939
## M 3.768151 1.418093 3.766122 4.569582

Qualitative data

A data sample is called qualitative, also known as categorical if its values belong to a collection of known defined non-overlapping classes.

Let us consider some artificial data consisting of the treatment and improvement of patients with rheumatoid arthritis.

treatment <- factor(rep(c(1, 2), c(43, 41)), levels = c(1, 2),
                    labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),
                   levels = c(1, 2, 3),
                   labels = c("none", "some", "marked"))

We can compute a cross-classification table

xtabs(~treatment+improved)
##          improved
## treatment none some marked
##   placebo   29    7      7
##   treated   13    7     21

Graphically,

spineplot(improved ~ treatment)

The R dataset UCBAdmissions contains aggregated data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

data("UCBAdmissions")
?UCBAdmissions
apply(UCBAdmissions, c(2,1), sum)
##         Admit
## Gender   Admitted Rejected
##   Male       1198     1493
##   Female      557     1278
prop.table(apply(UCBAdmissions, c(2,1), sum))
##         Admit
## Gender    Admitted  Rejected
##   Male   0.2646929 0.3298719
##   Female 0.1230667 0.2823685
ftable(UCBAdmissions)
##                 Dept   A   B   C   D   E   F
## Admit    Gender                             
## Admitted Male        512 353 120 138  53  22
##          Female       89  17 202 131  94  24
## Rejected Male        313 207 205 279 138 351
##          Female       19   8 391 244 299 317

The same but with a more readable format can be obtained using ftable

ftable(round(prop.table(UCBAdmissions), 3),
       row.vars="Dept", col.vars = c("Gender", "Admit"))
##      Gender     Male            Female         
##      Admit  Admitted Rejected Admitted Rejected
## Dept                                           
## A              0.113    0.069    0.020    0.004
## B              0.078    0.046    0.004    0.002
## C              0.027    0.045    0.045    0.086
## D              0.030    0.062    0.029    0.054
## E              0.012    0.030    0.021    0.066
## F              0.005    0.078    0.005    0.070

More interesting are the proportions admitted for each Gender by Dept combination (dimensions 2 and 3 of the array). Notice that male and female admission rates are about the same in all departments, except "A", where female admission rates are higher.

# prop.table(UCBAdmissions, c(2,3))
ftable(round(prop.table(UCBAdmissions, c(2,3)), 2),
       row.vars="Dept", col.vars = c("Gender", "Admit"))
##      Gender     Male            Female         
##      Admit  Admitted Rejected Admitted Rejected
## Dept                                           
## A               0.62     0.38     0.82     0.18
## B               0.63     0.37     0.68     0.32
## C               0.37     0.63     0.34     0.66
## D               0.33     0.67     0.35     0.65
## E               0.28     0.72     0.24     0.76
## F               0.06     0.94     0.07     0.93
## Data aggregated over departments
apply(UCBAdmissions, c(1, 2), sum)
##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

Applications and admissions by department at UC Berkeley can be viewed graphically.

spineplot(margin.table(UCBAdmissions, c(3, 2)),
           main = "Applications at UCB")

spineplot(margin.table(UCBAdmissions, c(3, 1)),
           main = "Admissions at UCB")

This data set is frequently used for illustrating Simpson's paradox. At issue is whether the data show evidence of sex bias in admission practices. There were 2691 male applicants, of whom 1198 (44.5%) were admitted, compared with 1835 female applicants of whom 557 (30.4%) were admitted. Men were much more successful in admissions than women. Wikipedia: Gender Bias UC Berkeley. See animation at link

Quantitative data

Quantitative data, also known as continuous data, consists of numeric data that support arithmetic operations.

head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

It consists of a collection of observations of the Old Faithful geyser in the USA Yellowstone National Park.  

There are two observation variables in the dataset. The first one, called eruptions, is the duration of the geyser eruptions. The second one, called waiting, is the length of waiting period until the next eruption. It turns out there is a correlation between the two variables.

plot(faithful)

Frequency distribution of quantitative data

The frequency distribution of a data variable is a summary of the data occurrence in a collection of non-overlapping categories.

Let us find the frequency distribution of the eruption duration in faithful data set.

duration <- faithful$eruptions
range(duration)
## [1] 1.6 5.1

 

Now we create the range of non-overlapping sub-intervals by defining a sequence of equal distance break points.

If we round the endpoints of the interval [1.6, 5.1] to the closest half-integers, we come up with the interval [1.5, 5.5]. Hence we set the breakpoints to be the half-integer sequence { 1.5, 2.0, 2.5, ... }.

breaks <- seq(1.5,5.5,by=0.5)
breaks
## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

 

Classify the eruption durations according to the half-unit-length sub-intervals with cut. As the intervals are to be closed on the left, and open on the right, we set the right argument to FALSE.

duration.cut = cut(duration, breaks, right=FALSE) 

Compute the frequency of eruptions in each sub-interval with the table function.

duration.freq = table(duration.cut) 
duration.freq
## duration.cut
## [1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5) [5,5.5) 
##      51      41       5       7      30      73      61       4

hist function does all the computaions to find the frequency distribution:

freq <- hist(duration)
freq
## $breaks
## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
## 
## $counts
## [1] 55 37  5  9 34 75 54  3
## 
## $density
## [1] 0.40441176 0.27205882 0.03676471 0.06617647 0.25000000 0.55147059
## [7] 0.39705882 0.02205882
## 
## $mids
## [1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
## 
## $xname
## [1] "duration"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
freq <- hist(duration,breaks = breaks)

hist(duration,50)

Density estimation builds an estimate of some underlying probability density function using an observed data sample.

library(graphics)
d <- density(faithful$eruptions)
d
## 
## Call:
##  density.default(x = faithful$eruptions)
## 
## Data: faithful$eruptions (272 obs.); Bandwidth 'bw' = 0.3348
## 
##        x                y            
##  Min.   :0.5957   Min.   :0.0002262  
##  1st Qu.:1.9728   1st Qu.:0.0514171  
##  Median :3.3500   Median :0.1447010  
##  Mean   :3.3500   Mean   :0.1813462  
##  3rd Qu.:4.7272   3rd Qu.:0.3086071  
##  Max.   :6.1043   Max.   :0.4842095
plot(d)

  Two dimension histogram:

library(gplots)
h2 <- hist2d(faithful, nbins=30,xlab="Duration in minutes",ylab="Waiting")

h2
## 
## ----------------------------
## 2-D Histogram Object
## ----------------------------
## 
## Call: hist2d(x = faithful, nbins = 30, xlab = "Duration in minutes", 
##     ylab = "Waiting")
## 
## Number of data points:  272 
## Number of grid bins:  30 x 30 
## X range: ( 1.6 , 5.1 )
## Y range: ( 43 , 96 )
names(h2)
## [1] "counts"   "x.breaks" "y.breaks" "x"        "y"        "nobs"    
## [7] "call"

 

Relative frequencies

duration.relfreq <- duration.freq / nrow(faithful) 
tab <- cbind(duration.freq, duration.relfreq) 
apply(tab,2,sum)
##    duration.freq duration.relfreq 
##              272                1

 

Cumulative frequency distribution

cumsum(duration.freq)
## [1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5) [5,5.5) 
##      51      92      97     104     134     207     268     272
cumsum(duration.relfreq)
##   [1.5,2)   [2,2.5)   [2.5,3)   [3,3.5)   [3.5,4)   [4,4.5)   [4.5,5) 
## 0.1875000 0.3382353 0.3566176 0.3823529 0.4926471 0.7610294 0.9852941 
##   [5,5.5) 
## 1.0000000

Bivariante Density estimation:

data("faithful")
attach(faithful)
Dens2d<-kde2d(eruptions,waiting)
image(Dens2d,xlab="eruptions",ylab="waiting")
contour(Dens2d,add=TRUE,col="black",lwd=2,nlevels=5)

detach("faithful")

Perspective plot:

persp(Dens2d,phi=30,theta=20,d=5,xlab="eruptions",ylab="waiting",zlab="",shade=.2,col="lightblue",expand=.85,ticktype = "detailed")

Advanced plotting ggplot2

library(ggplot2)

Why ggplot2?

Advantages of ggplot2

Example: Housing data download

housing <- read.csv("data/landdata-states.csv")
head(housing[1:5])
##   State region    Date Home.Value Structure.Cost
## 1    AK   West 2010.25     224952         160599
## 2    AK   West 2010.50     225511         160252
## 3    AK   West 2009.75     225820         163791
## 4    AK   West 2010.00     224994         161787
## 5    AK   West 2008.00     234590         155400
## 6    AK   West 2008.25     233714         157458
# change column names
names(housing)[names(housing) == "Land.Share..Pct."] <- "Land.Share.Pct"
head(housing, 10)
##    State region    Date Home.Value Structure.Cost Land.Value
## 1     AK   West 2010.25     224952         160599      64352
## 2     AK   West 2010.50     225511         160252      65259
## 3     AK   West 2009.75     225820         163791      62029
## 4     AK   West 2010.00     224994         161787      63207
## 5     AK   West 2008.00     234590         155400      79190
## 6     AK   West 2008.25     233714         157458      76256
## 7     AK   West 2008.50     232999         160092      72906
## 8     AK   West 2008.75     232164         162704      69460
## 9     AK   West 2009.00     231039         164739      66299
## 10    AK   West 2009.25     229395         165424      63971
##    Land.Share.Pct Home.Price.Index Land.Price.Index Year Qrtr
## 1            28.6            1.481            1.552 2010    1
## 2            28.9            1.484            1.576 2010    2
## 3            27.5            1.486            1.494 2009    3
## 4            28.1            1.481            1.524 2009    4
## 5            33.8            1.544            1.885 2007    4
## 6            32.6            1.538            1.817 2008    1
## 7            31.3            1.534            1.740 2008    2
## 8            29.9            1.528            1.660 2008    3
## 9            28.7            1.521            1.587 2008    4
## 10           27.9            1.510            1.536 2009    1

ggplot2 VS Base for simple graphs

Base graphics histogram are:

hist(housing$Home.Value)

library(ggplot2)
ggplot(housing, aes(x = Home.Value)) +
  geom_histogram()

Another simple graph

plot(Home.Value ~ Date,
     data=subset(housing, State == "MA"))
points(Home.Value ~ Date, col="red",
       data=subset(housing, State == "TX"))
legend(1975, 400000,
       c("MA", "TX"), title="State",
       col=c("black", "red"),
       pch=c(1, 1))

ggplot version, colored scatter plot example:

ggplot(subset(housing, State %in% c("MA", "TX")),
       aes(x=Date,
           y=Home.Value,
           color=State))+
  geom_point()

Geometric Objects And Aesthetics

*Aesthetic Mapping:*

In ggplot land /aesthetic/ means "something you can see". Examples include:

Each type of geom accepts only a subset of all aesthetics--refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function.

Geometric Objects (geom)

Geometric objects are the actual marks we put on a plot.

Examples include:

More at http://had.co.nz/ggplot2/

Some examples:

library(ggplot2)
?qplot
qplot(displ, hwy, data = mpg, colour = factor(cyl))

qplot(mpg, wt, data = mtcars)

qplot(mpg, wt, data = mtcars, colour = cyl)

qplot(mpg, wt, data = mtcars, size = cyl)

qplot(mpg, wt, data = mtcars, size = cyl, alpha = I(0.7))

qplot(mpg, wt, data = mtcars, facets = vs ~ am)

qplot(displ, hwy, data=mpg, facets = . ~ year) + geom_smooth()

p <- ggplot(mtcars)
p <- p + aes(wt, hp)
p + geom_point(aes(colour = factor(cyl)))

p <- ggplot(mtcars, aes(mpg, wt))
p + geom_point(colour = "darkblue")

Get data from the internet

filepath <- "http://idaejin.github.io/courses/R/2018/data/ggplot2_data.txt"

myData<-read.table(file=url(filepath),header=TRUE,sep="\t")

str(myData)

qplot(data=myData,x=BM,main="Histogram of BodyMass")

qplot(data=myData,x=BM,y=var1,log="xy",color=Tribe)

Maps

Example: Malignant Melanoma in the USA

Fisher and Belle (1993) report mortality rates due to malignant melanoma of the skin for white males during the period 1950-1969, for each state on the US mainland.

data("USmelanoma",package="HSAUR2")
head(USmelanoma)
##             mortality latitude longitude ocean
## Alabama           219     33.0      87.0   yes
## Arizona           160     34.5     112.0    no
## Arkansas          170     35.0      92.5    no
## California        182     37.5     119.5   yes
## Colorado          149     39.0     105.5    no
## Connecticut       159     41.8      72.8   yes

A data consists of 48 observations on the following 5 variables.

Plotting mortality rates

xr <- range(USmelanoma$mortality) * c(0.9, 1.1)

Let us plot mortality rates in

#layout(matrix(1:2, nrow = 2))
boxplot(USmelanoma$mortality, ylim = xr, horizontal = TRUE,xlab = "Mortality")

hist(USmelanoma$mortality, xlim = xr, xlab = "", main = "",axes = FALSE, ylab = "")
axis(1)

Malignant melanoma mortality rates by contiguity to an ocean

plot(mortality ~ ocean, data = USmelanoma, 
     xlab = "Contiguity to an ocean", ylab = "Mortality")

Histograms can often be misleading for displaying distributions because of their dependence on the number of classes chosen. An alternative is to formally estimate the density function of a variable and then plot the resulting estimate.

The estimated densities of malignant melanoma mortality rates by contiguity to an ocean looks like this:

dyes<- with(USmelanoma, density(mortality[ocean == "yes"]))
dno <- with(USmelanoma, density(mortality[ocean == "no"]))
plot(dyes, lty = 1, xlim = xr, main = "", ylim = c(0, 0.018))
lines(dno, lty = 2)
legend("topright", lty = 1:2, legend = c("Coastal State","Land State"), bty = "n")

Now we might move on to look at how mortality rates are related to the geographic location of a state as represented by the latitude and longitude of the centre of the state.

layout(matrix(1:2, ncol = 2))
plot(mortality ~ -longitude, data = USmelanoma)
plot(mortality ~ latitude, data = USmelanoma)

Mapping mortality rates

The data contains the longitude and latitude of the centroids

plot(-USmelanoma$longitude,USmelanoma$latitude,
     asp=1.5,cex=.3,pch=19,col="blue")

library("sp")
library("maps")
library("maptools")
library("RColorBrewer")
map("state")
points(-USmelanoma$longitude,USmelanoma$latitude,asp=1.5,cex=.3,pch=19,col="blue")

#qplot(-USmelanoma$longitude,USmelanoma$latitude,colour=USmelanoma$mortality,asp=1.5)+scale_color_gradient(low="blue", high="red")+geom_point()

#Create a function to generate a continuous color palette
rbPal <- colorRampPalette(c('blue','grey','red'))
#This adds a column of color values
# based on the y values
USmelanoma$Col <- (rbPal(10)[as.numeric(cut(USmelanoma$mortality,breaks = 10))])
map("state",xlim=c(-135,-65))
points(-USmelanoma$longitude,USmelanoma$latitude,col=USmelanoma$Col,asp=1.5,pch=19,cex=1.2)
legend("topleft",title="Decile",legend=quantile(USmelanoma$mortality,
        seq(0.1,1,l=10)),col =rbPal(10),pch=15,cex=1.,box.col = NA)

states <- map("state", plot = FALSE, fill = TRUE)
IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
rownames(USmelanoma) <- tolower(rownames(USmelanoma))

us1 <- map2SpatialPolygons(states, IDs=IDs,proj4string = CRS("+proj=longlat +datum=WGS84"))
us2 <- SpatialPolygonsDataFrame(us1, USmelanoma)

col <- colorRampPalette(c('blue', 'gray80','red'))

spplot(us2, "mortality", col.regions = col(200),
       par.settings = list(axis.line = list(col =  'transparent')),
       main="Map of the US showing malignant melanoma mortality rates")

Packages for Spatial Regression / Geostatistics / Spatial Point Pattern methods

install.packages(c("sp","maptools","spatstat","maps"))
library(maps)

Basic syntax

map(database = "world",regions=".")

Databases are available for US, France, Italy and New Zealand. For other countries, you need to import a database with the corresponding map.

map(database = "usa")

map("state")

ggmap offers plotting capabilities like ggplot2

library(ggmap)
geocode("Bilbao, Spain")
##   lon lat
## 1  NA  NA